AITopics | multimodal question

Collaborating Authors

multimodal question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

Deveci, İbrahim Ethem, Ataman, Duygu

arXiv.org Artificial IntelligenceNov-4-2025

The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.01365

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Industry: Education (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BEARCUBS: A benchmark for computer-using web agents

Song, Yixiao, Thai, Katherine, Pham, Chau Minh, Chang, Yapei, Nadaf, Mazin, Iyyer, Mohit

arXiv.org Artificial IntelligenceMar-10-2025

Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.07919

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Wang, Yanling, Zhao, Yihan, Chen, Xiaodong, Guo, Shasha, Liu, Lixin, Li, Haoyang, Xiao, Yong, Zhang, Jing, Li, Qi, Xu, Ke

arXiv.org Artificial IntelligenceMar-9-2025

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

large language model, machine learning, visualsimpleqa, (20 more...)

arXiv.org Artificial Intelligence

2503.06492

Country:

Asia > Singapore (0.04)
Asia > China (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models

Zhang, Peiyi, Zhang, Yazhou, Wang, Bo, Rong, Lu, Qin, Jing

arXiv.org Artificial IntelligenceSep-19-2024

With the recent evolution of large language models (LLMs), concerns about aligning such models with human values have grown. Previous research has primarily focused on assessing LLMs' performance in terms of the Helpful, Honest, Harmless (3H) basic principles, while often overlooking their alignment with educational values in the Chinese context. To fill this gap, we present Edu-Values, the first Chinese education values evaluation benchmark designed to measure LLMs' alignment ability across seven dimensions: professional ideology, cultural literacy, educational knowledge and skills, education laws and regulations, teachers' professional ethics, basic competencies, and subject knowledge. We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture. We conduct both human evaluation and automatic evaluation over 11 state-of-the-art (SoTA) LLMs, and highlight three main findings: (1) due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37; (2) LLMs perform well in subject knowledge and teaching skills but struggle with teachers' professional ethics and basic competencies; (3) LLMs excel at multiple-choice questions but perform poorly on subjective analysis and multi-modal tasks. This demonstrates the effectiveness and potential of the proposed benchmark. Our dataset is available at https://github.com/zhangpeii/Edu-Values.git.

dimension, llm, multimodal question, (14 more...)

arXiv.org Artificial Intelligence

2409.12739

Country:

North America > United States (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
Asia > China > Hong Kong (0.04)

Genre: Instructional Material (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

Long, Yuxing, Hui, Binyuan, Ye, Fulong, Li, Yanyang, Han, Zhuoxin, Yuan, Caixia, Li, Yongbin, Wang, Xiaojie

arXiv.org Artificial IntelligenceJan-5-2023

Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.

artificial intelligence, natural language, question answering, (4 more...)

arXiv.org Artificial Intelligence

2301.01949

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.80)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.80)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)

Add feedback

MultiModalQA: Complex Question Answering over Text, Tables and Images

Talmor, Alon, Yoran, Ori, Catav, Amnon, Lahav, Dan, Wang, Yizhong, Asai, Akari, Ilharco, Gabriel, Hajishirzi, Hannaneh, Berant, Jonathan

arXiv.org Artificial IntelligenceApr-13-2021

When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. QA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically generated questions and rephrase them into more fluent language. When presented with complex questions, people often do not know in advance what source(s) of information are relevant for answering it. In general scenarios, these sources can encompass multiple modalities, be it paragraphs of text, structured tables, images or combinations of those. For instance, a user might ponder "When was the famous painting with two touching fingers completed?", Answering this question is made possible by integrating information across both the textual and visual modalities. Recently, there has been substantial interest in question answering (QA) models that reason over multiple pieces of evidence (multi-hop questions (Yang et al., 2018; Talmor & Berant, 2018; Welbl et al., 2017)). In most prior work, the question is phrased in natural language and the answer is in a context, which may be a paragraph (Rajpurkar, 2016), a table (Pasupat & Liang, 2015), or an image (Antol et al., 2015). However, there has been relatively little work on answering questions that require integrating information across modalities.

modality, reasoning, wikientity, (17 more...)

arXiv.org Artificial Intelligence

2104.06039

Country:

Europe > Germany > Baden-Württemberg (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > Ohio (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry:

Media (0.68)
Leisure & Entertainment > Sports (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback